20 research outputs found

    Database integrated analytics using R : initial experiences with SQL-Server + R

    Get PDF
    © 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.Most data scientists use nowadays functional or semi-functional languages like SQL, Scala or R to treat data, obtained directly from databases. Such process requires to fetch data, process it, then store again, and such process tends to be done outside the DB, in often complex data-flows. Recently, database service providers have decided to integrate “R-as-a-Service” in their DB solutions. The analytics engine is called directly from the SQL query tree, and results are returned as part of the same query. Here we show a first taste of such technology by testing the portability of our ALOJA-ML analytics framework, coded in R, to Microsoft SQL-Server 2016, one of the SQL+R solutions released recently. In this work we discuss some data-flow schemes for porting a local DB + analytics engine architecture towards Big Data, focusing specially on the new DB Integrated Analytics approach, and commenting the first experiences in usability and performance obtained from such new services and capabilities.Peer ReviewedPostprint (author's final draft

    When and How to Apply Statistics, Machine Learning and Deep Learning Techniques

    Get PDF
    Machine Learning has become 'commodity' in engineering and experimental sciences, as calculus and statistics did before. After the hype produced during the 00's, machine learning (statistical learning, neural networks, etc.) has become a solid and reliable set of techniques available to the general researcher population to be included in their common procedures, far from the mysticism surrounding this field when only ML experts could solve modeling and prediction problems using such novel algorithms. But while knowledge on this field has settled among professionals, novice ML users still have trouble to decide when determined techniques could and should be applied to solve a given problem, sometimes ending with over-complicated solutions for simplistic problems, or complex problems partially solved by simplistic methods. This tutorial wants to introduce the most common techniques on statistical learning and neural networks, towards showing the proper techniques for each given scenario.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595).Peer ReviewedPostprint (author's final draft

    Modeling cloud resources using machine learning

    Get PDF
    Cloud computing is a new Internet infrastructure paradigm where management optimization has become a challenge to be solved, as all current management systems are human-driven or ad-hoc automatic systems that must be tuned manually by experts. Management of cloud resources require accurate information about all the elements involved (host machines, resources, offered services, and clients), and some of this information can only be obtained a posteriori. Here we present the cloud and part of its architecture as a new scenario where data mining and machine learning can be applied to discover information and improve its management thanks to modeling and prediction. As a novel case of study we show in this work the modeling of basic cloud resources using machine learning, predicting resource requirements from context information like amount of load and clients, and also predicting the quality of service from resource planning, in order to feed cloud schedulers. Further, this work is an important part of our ongoing research program, where accurate models and predictors are essential to optimize cloud management autonomic systems.Postprint (published version

    A resilient and distributed near real-time traffic forecasting application for Fog computing environments

    Get PDF
    In this paper we propose an architecture for a city-wide traffic modeling and prediction service based on the Fog Computing paradigm. The work assumes an scenario in which a number of distributed antennas receive data generated by vehicles across the city. In the Fog nodes data is collected, processed in local and intermediate nodes, and finally forwarded to a central Cloud location for further analysis. We propose a combination of a data distribution algorithm, resilient to back-haul connectivity issues, and a traffic modeling approach based on deep learning techniques to provide distributed traffic forecasting capabilities. In our experiments, we leverage real traffic logs from one week of Floating Car Data (FCD) generated in the city of Barcelona by a road-assistance service fleet comprising thousands of vehicles. FCD was processed across several simulated conditions, ranging from scenarios in which no connectivity failures occurred in the Fog nodes, to situations with long and frequent connectivity outage periods. For each scenario, the resilience and accuracy of both the data distribution algorithm, and the learning methods were analyzed. Results show that the data distribution process running in the Fog nodes is resilient to back-haul connectivity issues and is able to deliver data to the Cloud location even in presence of severe connectivity problems. Additionally, the proposed traffic modeling and forecasting method exhibits better behavior when run distributed in the Fog instead of centralized in the Cloud, especially when connectivity issues occur that force data to be delivered out of order to the Cloud.This project is partially supported by the European Research Council (ERC), Spain under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595). It is also partially supported by the Ministry of Economy of Spain under contract TIN2015-65316-P and Generalitat de Catalunya, Spain under contract 2014SGR1051, by the ICREA Academia program, and by the BSC-CNS Severo Ochoa program (SEV-2015-0493). The authors gratefully acknowledge the Reial Automvil Club de Catalunya (RACC) for the dataset of Floating Car Data provided.Peer ReviewedPostprint (published version

    Improving Maritime Traffic Emission Estimations on Missing Data with CRBMs

    Full text link
    Maritime traffic emissions are a major concern to governments as they heavily impact the Air Quality in coastal cities. Ships use the Automatic Identification System (AIS) to continuously report position and speed among other features, and therefore this data is suitable to be used to estimate emissions, if it is combined with engine data. However, important ship features are often inaccurate or missing. State-of-the-art complex systems, like CALIOPE at the Barcelona Supercomputing Center, are used to model Air Quality. These systems can benefit from AIS based emission models as they are very precise in positioning the pollution. Unfortunately, these models are sensitive to missing or corrupted data, and therefore they need data curation techniques to significantly improve the estimation accuracy. In this work, we propose a methodology for treating ship data using Conditional Restricted Boltzmann Machines (CRBMs) plus machine learning methods to improve the quality of data passed to emission models. Results show that we can improve the default methods proposed to cover missing data. In our results, we observed that using our method the models boosted their accuracy to detect otherwise undetectable emissions. In particular, we used a real data-set of AIS data, provided by the Spanish Port Authority, to estimate that thanks to our method, the model was able to detect 45% of additional emissions, of additional emissions, representing 152 tonnes of pollutants per week in Barcelona and propose new features that may enhance emission modeling.Comment: 12 pages, 7 figures. Postprint accepted manuscript, find the full version at Engineering Applications of Artificial Intelligence (https://doi.org/10.1016/j.engappai.2020.103793

    Challenges and Opportunities for RISC-V Architectures towards Genomics-based Workloads

    Full text link
    The use of large-scale supercomputing architectures is a hard requirement for scientific computing Big-Data applications. An example is genomics analytics, where millions of data transformations and tests per patient need to be done to find relevant clinical indicators. Therefore, to ensure open and broad access to high-performance technologies, governments, and academia are pushing toward the introduction of novel computing architectures in large-scale scientific environments. This is the case of RISC-V, an open-source and royalty-free instruction-set architecture. To evaluate such technologies, here we present the Variant-Interaction Analytics use case benchmarking suite and datasets. Through this use case, we search for possible genetic interactions using computational and statistical methods, providing a representative case for heavy ETL (Extract, Transform, Load) data processing. Current implementations are implemented in x86-based supercomputers (e.g. MareNostrum-IV at the Barcelona Supercomputing Center (BSC)), and future steps propose RISC-V as part of the next MareNostrum generations. Here we describe the Variant Interaction Use Case, highlighting the characteristics leveraging high-performance computing, indicating the caveats and challenges towards the next RISC-V developments and designs to come from a first comparison between x86 and RISC-V architectures on real Variant Interaction executions over real hardware implementations

    The holistic perspective of the INCISIVE project : artificial intelligence in screening mammography

    Get PDF
    Finding new ways to cost-effectively facilitate population screening and improve cancer diagnoses at an early stage supported by data-driven AI models provides unprecedented opportunities to reduce cancer related mortality. This work presents the INCISIVE project initiative towards enhancing AI solutions for health imaging by unifying, harmonizing, and securely sharing scattered cancer-related data to ensure large datasets which are critically needed to develop and evaluate trustworthy AI models. The adopted solutions of the INCISIVE project have been outlined in terms of data collection, harmonization, data sharing, and federated data storage in compliance with legal, ethical, and FAIR principles. Experiences and examples feature breast cancer data integration and mammography collection, indicating the current progress, challenges, and future directions

    When and How to Apply Statistics, Machine Learning and Deep Learning Techniques

    No full text
    Machine Learning has become 'commodity' in engineering and experimental sciences, as calculus and statistics did before. After the hype produced during the 00's, machine learning (statistical learning, neural networks, etc.) has become a solid and reliable set of techniques available to the general researcher population to be included in their common procedures, far from the mysticism surrounding this field when only ML experts could solve modeling and prediction problems using such novel algorithms. But while knowledge on this field has settled among professionals, novice ML users still have trouble to decide when determined techniques could and should be applied to solve a given problem, sometimes ending with over-complicated solutions for simplistic problems, or complex problems partially solved by simplistic methods. This tutorial wants to introduce the most common techniques on statistical learning and neural networks, towards showing the proper techniques for each given scenario.This project has received funding from the European Research Council (ERC) under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 639595).Peer Reviewe
    corecore